Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Jan 19, 2021

Other than basic type support, this PR also covers the following cases:

  • List type
  • nested struct
  • nested list
  • list of struct
  • struct with list field

@github-actions
Copy link

@houqp houqp marked this pull request as draft January 19, 2021 04:36
@codecov-io
Copy link

codecov-io commented Jan 19, 2021

Codecov Report

Merging #9256 (f200959) into master (3e47777) will decrease coverage by 0.01%.
The diff coverage is 84.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9256      +/-   ##
==========================================
- Coverage   81.89%   81.87%   -0.02%     
==========================================
  Files         215      216       +1     
  Lines       52988    53097     +109     
==========================================
+ Hits        43392    43474      +82     
- Misses       9596     9623      +27     
Impacted Files Coverage Δ
rust/arrow/src/array/array_struct.rs 88.43% <ø> (ø)
rust/arrow/src/array/equal/utils.rs 75.49% <0.00%> (ø)
rust/arrow/src/bytes.rs 53.12% <ø> (ø)
rust/arrow/src/compute/kernels/aggregate.rs 74.93% <ø> (ø)
rust/arrow/src/ipc/gen/SparseTensor.rs 0.00% <0.00%> (ø)
rust/arrow/src/ipc/gen/Tensor.rs 0.00% <0.00%> (ø)
rust/benchmarks/src/bin/nyctaxi.rs 0.00% <ø> (ø)
rust/benchmarks/src/bin/tpch.rs 6.97% <0.00%> (ø)
rust/datafusion/src/datasource/memory.rs 79.75% <0.00%> (ø)
rust/datafusion/src/logical_plan/operators.rs 75.00% <ø> (ø)
... and 136 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5665a0c...f200959. Read the comment docs.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this one a quick look, and it looks good so far. I think the nulls are not accounted for.

I left some comments accordingly.

}

pub fn as_list_array<S: OffsetSizeTrait>(arr: &ArrayRef) -> &GenericListArray<S> {
pub fn as_generic_list_array<S: OffsetSizeTrait>(arr: &ArrayRef) -> &GenericListArray<S> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking change to make naming more consistent with array type.

@houqp houqp marked this pull request as ready for review February 1, 2021 05:50
@houqp houqp requested a review from jorgecarleitao February 1, 2021 05:51
@houqp houqp changed the title WIP: ARROW-11310: [Rust] implement JSON writer ARROW-11310: [Rust] implement JSON writer Feb 1, 2021
@houqp
Copy link
Member Author

houqp commented Feb 1, 2021

@jorgecarleitao @nevi-me ready for review.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @houqp ! This looks super clean and IMO ready to go.

I would add a note to the readme where we discuss json readers. IMO this is a major feature!

@jorgecarleitao
Copy link
Member

cc @andygrove @alamb @nevi-me @sunchao

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation @houqp, minor comments

struct_array_to_jsonmap_array(as_struct_array(array), array.len());
jsonmaps.into_iter().map(Value::Object).collect()
}
_ => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there'd be demand for dictionary support, but we can add that separately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree -- on both counts :)

Copy link
Member Author

@houqp houqp Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i was planning to add dictionary support, but this patch set grew larger than I expected, so I decided to leave that to a follow up PR for easier review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick skim at this PR and it looks nice to me. Thank you @houqp

@houqp houqp requested a review from alamb February 2, 2021 05:35
@alamb
Copy link
Contributor

alamb commented Feb 2, 2021

I plan to review this PR later today and merge unless I hear otherwise

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @houqp

@alamb alamb closed this in 6c880b2 Feb 2, 2021
@houqp houqp deleted the qp_json_writer branch February 3, 2021 00:44
alamb added a commit that referenced this pull request Feb 26, 2021
…s newline delimited json streams

## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```

Which is not technically valid JSON, which would look something like this:

```
[
  {"foo":1},
  {"bar":1}
]
```

## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).

## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)

2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.

3. Added / cleaned up a bunch of documentation and comments.

## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp #9256)

However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?

Closes #9575 from alamb/alamb/json_arrays

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…s newline delimited json streams

## Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
```json
{"foo":1}
{"bar":1}
```

Which is not technically valid JSON, which would look something like this:

```
[
  {"foo":1},
  {"bar":1}
]
```

## New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in influxdata/influxdb_iox#870, and I want to propose contributing it back here).

## Other Changes:
1. Added the function `into_inner()` to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. [BufReader::into_inner](https://doc.rust-lang.org/std/io/struct.BufReader.html#method.into_inner)

2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via `BufReader`) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.

3. Added / cleaned up a bunch of documentation and comments.

## Questions
I went with parameterizing the `Writer` output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp apache/arrow#9256)

However would people prefer a single `Writer` that took an `Options` struct or something to determine how it wrote out data?

Closes #9575 from alamb/alamb/json_arrays

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants